Variational Stochastic Gradient Descent
نویسنده
چکیده
In Bayesian approach to probabilistic modeling of data we select a model for probabilities of data that depends on a continuous vector of parameters. For a given data set Bayesian theorem gives a probability distribution of the model parameters. Then the inference of outcomes and probabilities of new data could be found by averaging over the parameter distribution of the model, which is an intractable problem. In this paper we propose to use Variational Bayes (VB) to estimate Gaussian posterior of model parameters for a given Gaussian prior and Bayesian updates in a form that resembles SGD rules. It is shown that with incremental updates of posteriors for a selected sequence of data points and a given number of iterations the variational approximations are defined by a trajectory in space of Gaussian parameters, which depends on a starting point defined by priors of the parameter distribution, which are true hyper-parameters. The same priors are providing a weight decay or L2 regularization for the training. Then a selection of L2 regularization parameters and a number of iterations is completely defining a learning rule for VB SGD optimization, unlike other methods with momentum (Duchi et al., 2011; Kingma & Ba, 2014; Zeiler, 2012) that need selecting learning, regularization rates, etc., separately. We consider application of VB SGD for important practical case of fast training neural networks with very large data. While the speedup is achieved by partitioning data and training in parallel the resulting set of solutions obtained with VB SGD forms a Gaussian mixture. By applying VB SGD optimization to the Gaussian mixture we can merge multiple neural networks of same dimensions into a new single neural network that has almost the same performance as an original Gaussian mixture.
منابع مشابه
Importance Sampled Stochastic Optimization for Variational Inference
Variational inference approximates the posterior distribution of a probabilistic model with a parameterized density by maximizing a lower bound for the model evidence. Modern solutions fit a flexible approximation with stochastic gradient descent, using Monte Carlo approximation for the gradients. This enables variational inference for arbitrary differentiable probabilistic models, and conseque...
متن کاملA Variational Analysis of Stochastic Gradient Algorithms
Stochastic Gradient Descent (SGD) is an important algorithm in machine learning. With constant learning rates, it is a stochastic process that, after an initial phase of convergence, generates samples from a stationary distribution. We show that SGD with constant rates can be effectively used as an approximate posterior inference algorithm for probabilistic modeling. Specifically, we show how t...
متن کاملTutorial on Variational Autoencoders
In just three years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions. VAEs are appealing because they are built on top of standard function approximators (neural networks), and can be trained with stochastic gradient descent. VAEs have already shown promise in generating many kinds of complicated data, incl...
متن کاملGaussian Processes for Big Data through Stochastic Variational Inference
Gaussian processes [GP 10] are perhaps the dominant approach for inference on functions. They underpin a range of algorithms for regression, classification and unsupervised learning. Unfortunately, exact inference in a GP has complexity O(n) with storage demands of O(n) and this hinders application of these models for ‘big data’. Various approximate techniques have been suggested [see e.g. 1, 1...
متن کاملRe-using gradient computations in automatic variational inference
Automatic variational inference has recently become feasible as a scalable inference tool for probabilistic programming. The state-of-the-art algorithms are stochastic in two respects: they use stochastic gradient descent to optimize an expectation that is estimated with stochastic approximation. The core computation of such algorithms involves evaluating the loss and its automatically differen...
متن کاملEarly Stopping as Nonparametric Variational Inference
We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric approximate posterior distribution. This distribution is implicitly defined by the transformation of an initial distribution by a sequence of optimization steps. By tracking the change in entropy over these distributions during optimization, we form a scalable, unbiased estim...
متن کامل